Thematic clustering of text documents using an EM-based approach
نویسندگان
چکیده
منابع مشابه
Thematic clustering of text documents using an EM-based approach
Clustering textual contents is an important step in mining useful information on the web or other text-based resources. The common task in text clustering is to handle text in a multi-dimensional space, and to partition documents into groups, where each group contains documents that are similar to each other. However, this strategy lacks a comprehensive view for humans in general since it canno...
متن کاملUsing an Evolving Thematic Clustering in a Text Segmentation Process
The thematic text segmentation task consists in identifying the most important thematic breaks in a document in order to cut it into homogeneous passages. We propose in this paper an algorithm for linear text segmentation on general corpuses. It relies on an initial clustering of the sentences of the text. This preliminary partitioning provides a global view on the sentences relations existing ...
متن کاملClustering Full Text Documents
An index or topic hierarchy of full-text documents can organize a domain and speed information retrieval. Traditional indexes, like the Library of Congress system or Dewey Decimal system, are generated by hand, updated infrequently, and applied inconsistently. With machine learning, they can be generated automatically, updated as new documents arrive, and applied consistently. Despite the appea...
متن کاملDiscriminative Clustering of Text Documents
Vector-space and distributional methods for text document clustering are discussed. Discriminative clustering, a recently proposed method, uses external data to find taskrelevant characteristics of the documents, yet the clustering is defined even with no external data. We introduce a distributional version of discriminative clustering that represents text documents as probability distributions...
متن کاملUsing EM to Classify Text from Labeled and Unlabeled Documents
This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is significant because in many important text classification problems obtaining classification labels is expensive, while large quantities of unlabeled documents are readily available. We present a theoretical ar...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Journal of Biomedical Semantics
سال: 2012
ISSN: 2041-1480
DOI: 10.1186/2041-1480-3-s3-s6